Conversation
|
Stefano,
I now realize what the problem is with the comma. The Tifinagh script has a separator character that I learned was used like a comma, so I mapped it to the Latin comma. Unfortunately, The languages that use Tifinagh script also use regular Latin punctuation marks, so I will need to map the special Tifinagh separator to something unique in the Latin script—perhaps the Latin comma with a double underscore (\u0333).
The other thing you mentioned, the possible problem with other characters, is a a font problem. Combinations of Tifinagh characters with combining characters aren’t displaying correctly in MS Excel. The underlying character values are correct, fortunately. I will fix the mapping of the Tifinagh separator and send you a new “tifinagh_generic.yml” file as well as test data.
Randy
…----------------------------------------
Randall K. Barry ᠷᠠᠨ᠋ᠳ᠋ᠠᠯᠯ ᠺ᠊ ᠪᠠᠷᠷᠶ
Email: ***@***.***> ***@***.***
Mobile: +1-703-244-1232
629 24th St. South
Arlington, VA 22202-2525 U.S.A.
From: Stefano Cossu ***@***.***>
Sent: Saturday, April 4, 2026 20:19
To: lcnetdev/scriptshifter ***@***.***>
Cc: Randall K. Barry ***@***.***>; Mention ***@***.***>
Subject: [lcnetdev/scriptshifter] Add Tifinagh languages and tests. (PR #292)
@RandyBarry <https://github.com/RandyBarry> see attached test results. There seems to be a problem transliterating the comma character, and maybe a couple of other characters that I can't tell right now if it's an incorrect test pair or an incorrect mapping.
tamashek.csv <https://github.com/user-attachments/files/26485324/tamashek.csv>
tamazight_moroccan.csv <https://github.com/user-attachments/files/26485325/tamazight_moroccan.csv>
tifinagh_generic.csv <https://github.com/user-attachments/files/26485326/tifinagh_generic.csv>
_____
You can view, comment on, or merge this pull request online at:
#292
Commit Summary
* 721a356 <721a356> Add Tifinagh languages and tests.
File Changes
(6 <https://github.com/lcnetdev/scriptshifter/pull/292/files> files)
* A scriptshifter/tables/data/tamashek.yml <https://github.com/lcnetdev/scriptshifter/pull/292/files#diff-e6383418c2f60e43b544c9fdf10b2a2875967c63d58688c5e02e681da246a769> (112)
* A scriptshifter/tables/data/tifinagh_generic.yml <https://github.com/lcnetdev/scriptshifter/pull/292/files#diff-72f7ae57cf635845e9a0d8e87cf19f8c1aaf1867134e9204ec19f909e7aa113e> (170)
* M scriptshifter/tables/index.yml <https://github.com/lcnetdev/scriptshifter/pull/292/files#diff-09f5531e9c25fa16448edb4eccda16f6bc0062c6da1ed21b4af3cfccbae1c991> (18)
* A test/data/script_samples/tamashek.csv <https://github.com/lcnetdev/scriptshifter/pull/292/files#diff-3c8487e4f754b4a86aa543bfc207e491831065943a4c64f0a87d9226153e3cec> (1)
* A test/data/script_samples/tamazight_moroccan.csv <https://github.com/lcnetdev/scriptshifter/pull/292/files#diff-446fd01bc3599a87a7f306d934979b6491c02980fdedd2767fddef9bea1ab7ad> (2)
* A test/data/script_samples/tifinagh_generic.csv <https://github.com/lcnetdev/scriptshifter/pull/292/files#diff-b215289bfb1e5752ee5d309d273c18d11bab80168a73016d186293a81ae66212> (5)
Patch Links:
* https://github.com/lcnetdev/scriptshifter/pull/292.patch
* https://github.com/lcnetdev/scriptshifter/pull/292.diff
—
Reply to this email directly, view it on GitHub <#292> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/BA4HLGL4G7M4TYHO6U57NNT4UGRA3AVCNFSM6AAAAACXM5UTFGVHI2DSMVQWIX3LMV43ASLTON2WKOZUGIYDMMBXGU2TEMY> .
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
|
I have updated the mappings and the tests. There are still a few outstanding issues: |
|
What are the outstanding issues in the mappings for Tifinagh script and/or the languages that use it? I don't see the before/after comparisons. |
|
Sorry, I had attached your test files instead of the reports. Here are the correct files: test_tamashek.log As you can see, some of the issues are related to the comma + combining underscore we discussed over email. E.g. "Imdanen, akken" transliterates "ⵉⵎⴷⴰⵏⴻⵏ, ⴰⴽⴽⴻⵏ" instead of "ⵉⵎⴷⴰⵏⴻⵏ⵰ ⴰⴽⴽⴻⵏ" because the lone comma is not mapped -- the comma followed by a combining underscore is. Conversely, "ⵉⵎⴷⴰⵏⴻⵏ⵰ ⴰⴽⴽⴻⵏ" transliterates into " Imdanen,̲ akken" instead of the expected "Imdanen, akken". |
|
Stefano: unconverted lone comma is okay. User of the tifinagh script use Latin punctuation, like the regular comma, in addition to the special tifinagh separator. A regular comma should remain a comma whereas the comma with low line converts to the tifinagh separator. We need to test both the regular comma AND the special separator. Both seem to be fine now. |
|
@RandyBarry Just to confirm, currently my terminal (but I think it's a common behavior) the sequence If the transliteration table is correct, then I have to adjust the test strings so that "ⵉⵎⴷⴰⵏⴻⵏ⵰ ⴰⴽⴽⴻⵏ" transliterates to "Imdanen, ̲akken". |
|
Stefano,I know it’s an odd combination to have a combining low line under (after) a comma, but the order is dictated by Unicode. Combining characters are supposed to be encoded AFTER the character that they modify. If combining low line is encoded first it would be interpreted as modifying any character that preceded it, which is NOT the intent for this script. In MARC, a combining character is used to differentiate spacing character, like an alphabetic letter or some spacing mark, to be related but different. A similar case is the uppercase soft or hard sign, both of which are normally romanized to the same spacing marks prime or double prime, respectively. In order to map the uppercase Cyrillic letters to something unique and reversible, I used the double low line after the prime or double prime. Does this make my use of the combining low line after the Latin comma make more sense as a unique combination that maps to the special Tifinagh separator?I hope I’m not missing something here. In my testing, Tifinagh strings with either the normal Latin comma or the special Tifinagh separator convert as I expect: comma stays a comma, even when appearing in Tifinagh script, and comma+combining low line converts to the special Tifinagh separator. Uses of the script are not consistent. From what I’ve read, they sometimes use the special separator to end sections of text. It could be though of as various Latin punctuation marks but seems closest to a comma, so I mapped it to comma+combining low line. I could have mapped it to period+combining low line. I even considered mapping it to the solidus “/“ (forward slash mark), but it is most often used interchangeably with a comma.Randy Barry - ᠷᠧᠨᠳᠢ ***@***.***>On Apr 19, 2026, at 11:52, Stefano Cossu ***@***.***> wrote:scossu left a comment (lcnetdev/scriptshifter#292)
@RandyBarry Just to confirm, currently my terminal (but I think it's a common behavior) the sequence \u002C\u0332 displays a simple comma + the combining underscore, which underscores the following character, not the comma. You are referring to an underlined comma, so I wonder if you intended to use \u0332\u002C (combined underlined comma) in the transliteration table.
If the transliteration table is correct, then I have to adjust the test strings so that "ⵉⵎⴷⴰⵏⴻⵏ⵰ ⴰⴽⴽⴻⵏ" transliterates to "Imdanen, ̲ akken".
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
|
Just a few more tests failing: test_tamashek.log |
|
Stefano,
It’s difficult to verify conversion problems due to the alignment of the error markers in the test log. I understand why it looks like there are conversion errors. I will need to segregate the situations which are being flagged as errors to get around this problem.
The issue is this: in the ALA-LC romanization tables, there are options for how three Tifinagh letters are transliterated. Those options are handled in the roman-to-script mapping by providing two different Latin transliterations . Each of these three Tifinagh letters can be romanized in two ways. A perfect example is the Tifinagh letter “ⵐ” which can be transliterated as “ŋ” or “n̳”. The first option is a new Unicode Latin character which was never used in MARC before. The Unicode character “ŋ” is now the preferred romanization for “ⵐ” but librarians can use the other option if their local system doesn’t support “ŋ” yet. The new letter was used in the new ALA-LC Romanization Table with the assumption that, any system that can support Tifinagh script should be able to support the Latin character “ŋ”. I tend to agree. ScriptShifter needs to know what to do with either the new or old-style romanizations that make use of new Latin characters. The problem is when ScriptShifter does script-to-roman conversion of data that included both options in Latin. ScriptShifter only has one mapping to Latin for each Tifinagh characters. For “ⵐ”, where in the Latin input file there were two different representations of “ⵐ”, there is only one Latin transliteration output; “ŋ (ŋ)”. The “(n̳)” originally seen has been replaced with “(ŋ)”. I put parentheses around the alternative romanizations so that I could see what happened to them when they passed thru ScriptShifter in a roundtrip way. What you are seeing in the test confirms that ScriptShifter is actually handling these optional transliterations properly. Where you see “ə (ə)” and “ŋ (ŋ)” and “ɣ (ɣ)” in Latin output that started out as “ə (e̳)”, “ŋ (n̳)” and “ɣ (g̳)”, there are cases where Tifinagh letters had the two Latin transliteration options. There’s no way to tell ScriptShifter to output the old (less-desirable) romanization with the double low line under the regular Latin letter.
What I could do is put optional transliterations like this in a separate test line and flag it as roman-to-script only. That way they would not confuse the otherwise good conversion results. A test string with:
ə (e̳) ŋ (n̳) ɣ (g̳) would convert to ⵓ (ⵓ) ⵐ (ⵐ) ⵘ (ⵘ). The reverse would be (correctly!): ə (ə) ŋ (ŋ) ɣ (ɣ)
Does this make sense? In my own testing, I knew where these one-way conversions were, so I ignored the differences. I used the parentheses in the Tamasheq test data to flag them as special but didn’t bother with the parentheses in the two other test sets. I should have but the parentheses around the alternative romanizations in all three test sets to flag these as letters as special in romanization, or I should have separated them out onto a separate test line. I’m sorry I added this confusion to the test data. The bottom line is, the conversions are working fine. I can send you revised test data to separate out the alternatives onto separate lines, if you want.
Randy
…----------------------------------------
Randall K. Barry ᠷᠠᠨ᠋ᠳ᠋ᠠᠯᠯ ᠺ᠊ ᠪᠠᠷᠷᠶ
Email: ***@***.***> ***@***.***
Mobile: +1-703-244-1232
629 24th St. South
Arlington, VA 22202-2525 U.S.A.
From: Stefano Cossu ***@***.***>
Sent: Sunday, April 19, 2026 19:48
To: lcnetdev/scriptshifter ***@***.***>
Cc: Randall K. Barry ***@***.***>; Mention ***@***.***>
Subject: Re: [lcnetdev/scriptshifter] Add Tifinagh languages and tests. (PR #292)
<https://avatars.githubusercontent.com/u/4306733?s=20&v=4> scossu left a comment (lcnetdev/scriptshifter#292) <#292?email_source=notifications&email_token=BA4HLGMPZABC6WGWLNBYUWD4WVQTFA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTIMRXG4YDSOBZHA2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJNLQOJPWG33NNVSW45C7N5YGK3S7MNWGSY3L#issuecomment-4277098984>
Just a few more tests failing:
test_tamashek.log <https://github.com/user-attachments/files/26878242/test_tamashek.log>
test_tamazight_moroccan.log <https://github.com/user-attachments/files/26878243/test_tamazight_moroccan.log>
test_tifinagh_generic.log <https://github.com/user-attachments/files/26878244/test_tifinagh_generic.log>
—
Reply to this email directly, view it on GitHub <#292?email_source=notifications&email_token=BA4HLGMPZABC6WGWLNBYUWD4WVQTFA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTIMRXG4YDSOBZHA2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJNLQOJPWG33NNVSW45C7N5YGK3S7MNWGSY3L#issuecomment-4277098984> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/BA4HLGOXNMO2GBDAEVNOLEL4WVQTFAVCNFSM6AAAAACXM5UTFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DENZXGA4TQOJYGQ> .
You are receiving this because you were mentioned. <https://github.com/notifications/beacon/BA4HLGOKT3RBT5PRSLESQAL4WVQTFBFCNFSM6AAAAACXM5UTFGWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTX655M6RJTSMVQXG33OU5WWK3TUNFXW4.gif> Message ID: ***@***.*** ***@***.***> >
|


@RandyBarry see attached test results. There seems to be a problem transliterating the comma character, and maybe a couple of other characters that I can't tell right now if it's an incorrect test pair or an incorrect mapping.
tamashek.csv
tamazight_moroccan.csv
tifinagh_generic.csv